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Abstract — Error  Subspace  Statistical  Estimation  (ESSE),  the  uncertainty  prediction  and  data  assimilation  methodology  employed  for 
real-time  ocean  forecasts,  is  based  on  a  characterization  and  prediction  of  the  largest  uncertainties.  This  is  carried  out  by  evolving  an 
error  subspace  of  variable  size.  We  use  an  ensemble  of  stochastic  model  simulations,  initialized  based  on  an  estimate  of  the  dominant 
initial  uncertainties,  to  predict  the  error  subspace  of  the  model  fields.  The  dominant  error  covariance  (generated  via  an  SVD  of  the 
ensemble  generated  error  covariance  matrix)  is  used  for  data  assimilation.  The  resulting  ocean  fields  are  provided  as  the  input  to 
acoustic  modeling,  allowing  for  the  prediction  and  study  of  the  spatiotemporal  variations  in  acoustic  propagation. 

The  ESSE  procedure  is  a  classic  case  of  Many  Task  Computing:  These  codes  are  managed  based  on  dynamic  workflows  for  (i)  the 
perturbation  of  the  initial  mean  state,  (ii)  the  subsequent  ensemble  of  stochastic  PE  model  runs,  (iii)  the  continuous  generation  of 
the  covariance  matrix,  (iv)  the  successive  computations  of  the  SVD  of  the  ensemble  spread  until  a  convergence  criterion  is  satisfied, 
and  (v)  the  data  assimilation.  Its  ensemble  nature  makes  it  a  many  task  data  intensive  application  and  its  dynamic  workflow  gives  it 
heterogeneity.  Subsequent  acoustics  propagation  modeling  involves  a  very  large  ensemble  of  very  short  in  duration  acoustics  runs. 
We  study  the  execution  characteristics  and  challenges  of  a  distributed  ESSE  workflow  on  a  large  dedicated  cluster  and  the  usability  of 
enhancing  this  with  runs  on  Amazon  EC2  and  the  Teragrid  and  the  I/O  challenges  faced. 
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1  Introduction 

UR  initial  motivation  was  speeding  up  the  exe¬ 
cution  of  our  stochastic  ocean  data  assimilation 
ensembles  via  distributed  computations  and  thereby  al¬ 
lowing  the  evaluation  of  larger  ensembles  in  the  same 
amount  of  real  time.  Our  approach  resulted  in  a  clear 
case  of  a  Many  Task  Computing  (MTC)  [1]  application. 

In  what  follows.  Section  2  describes  the  application 
area  of  ocean  data  assimilation  and  provides  details 
about  the  timeline  of  real-time  data  assimilation  and 
ocean-acoustic  modeling.  Section  3  describes  ESSE  [2], 
[3],  the  data  assimilation  and  error  estimation  approach 
used.  Section  4  describes  the  ESSE  implementation  as 
a  MTC  application  and  the  options  we  face  in  terms 
of  optimizing  I/O  issues.  This  is  followed  in  Section  5 
by  a  discussion  of  the  practical  MTC  use  of  ESSE  on 
local  clusters.  Grids  and  Amazon  EC2.  We  then  illustrate 
scientific  results  in  Section  6  and  discussing  future  work 
in  Section  7.  Conclusions  are  in  Section  8. 

2  Ocean  Data  Assimilation 

Data  Assimilation  (DA)  is  a  quantitative  approach  to 
optimally  combine  models  and  observations  that  is  con¬ 
sistent  with  model  and  data  uncertainties.  Ocean  DA 
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can  extract  maximum  knowledge  from  the  sparse  and 
expensive  measurements  of  highly  variable  ocean  dy¬ 
namics.  The  ultimate  goal  is  to  better  understand  and 
predict  these  dynamics  on  multiple  spatial  and  temporal 
scales.  There  are  many  applications  that  involve  DA  or 
build  on  its  results,  including:  coastal,  regional,  seasonal, 
and  inter-annual  ocean  and  climate  dynamics;  carbon 
and  biogeochemical  cycles;  ecosystem  dynamics;  ocean 
engineering;  observing-system  design;  coastal  manage¬ 
ment;  fisheries;  pollution  control;  naval  operations;  and 
defense  and  security.  These  applications  have  different 
requirements  that  lead  to  variations  in  the  DA  schemes 
utilized. 

The  ocean  physics  involves  a  multitude  of  phenom¬ 
ena  occurring  on  multiple  scales,  from  molecular  and 
turbulent  processes  to  decadal  variations  and  climate 
dynamics.  Life  takes  place  in  the  ocean,  from  bacteria 
and  plankton  cells  to  fish  and  mammals.  The  range  of 
space  scales  is  from  about  1  mm  to  10,000  km,  and 
of  time  scales,  from  about  1  s  to  100  years  and  more. 
Eeatures  and  properties  in  the  ocean  interact  over  these 
scales  but  significant  interactions  occur  predominantly 
over  certain  ranges  of  scales,  which  are  usually  referred 
to  as  scale  windows.  Eor  example,  the  internal  weather  of 
the  sea,  the  so-called  oceanic  mesoscale,  mainly  consists 
of  phenomena  occurring  over  a  day  to  months  and  over 
kilometers  to  hundreds  of  kilometers.  This  is  one  of 
the  most  energetic  scale  windows  in  the  ocean  and  the 
present  MTC  study  focuses  on  this  window  of  processes. 

A  comprehensive  prediction  should  include  the  relia¬ 
bility  of  estimated  quantities.  This  allows  an  adequate 
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use  of  these  estimates  in  a  scientific  or  operational 
application.  In  a  prediction  with  a  model  integrating 
either  in  time  and  in  space,  errors  in  the  initial  data 
(initial  conditions),  boundary  conditions  and  models 
themselves  impact  accuracy.  Predicted  uncertainties  then 
contain  the  integrated  effects  of  the  initial  error  and  of 
the  errors  introduced  continuously  during  model  inte¬ 
gration.  Mathematically,  uncertainty  can  be  defined  here 
by  the  probability  density  function  (PDF)  of  the  error 
in  the  estimate.  Since  ocean  fields  are  four-dimensional, 
uncertainty  representations  are  here  also  fields,  with 
structures  in  time  and  space. 

Realistic  simulations  of  four-dimensional  ocean  fields 
are  carried  out  over  broad  numerical  domains,  e.g.  0(10- 
1000)  km  for  0(10-1000)  days.  The  number  of  grid  points 
and  thus  of  discretized  state  variables  are  very  large, 
usually  of  0(10^  —  10^).  On  the  other  hand,  ocean  data  are 
limited  in  temporal  and  spatial  coverage.  Commonly,  the 
number  of  data  points  for  an  at-sea  sampling  campaign 
is  of  0(10^  — 10^).  For  substantial  scientific  advances  and 
to  reduce  uncertainties,  the  sources  of  information,  the 
various  data  and  dynamical  models,  are  combined  by 
data  assimilation  [4].  This  combination  is  challenging 
and  expensive  to  carry  out,  but  optimal  in  the  sense  that 
each  type  of  information  is  weighted  in  accord  with  its 
uncertainty. 

2.1  Real  Time  Assimilation 

An  important  clarification  needs  to  be  made  regarding 
the  different  times  involved  in  ocean  forecasting:  the 
observation  time,  forecaster  time  and  simulation  time 
(Fig.  1).  New  observations  are  made  available  in  batches 
(Fig.  1,  first  row)  during  periods  from  the  start  of 
the  experiment  (To)  up  to  the  final  time  (T/).  During  the 
experiment,  for  each  prediction  k  (Fig.  1,  zoom  in  middle 
row),  the  forecaster  repeats  a  set  of  tasks  (from  Tq  to  ). 
These  tasks  include  the  processing  of  the  currently  avail¬ 
able  data  and  model  (from  Tq  to  Tq),  the  computation  of 
r  +  1  data-driven  forecast  simulations  (from  tg  to 
and  the  study,  selection  and  web-distribution  of  the  best 
forecasts  (from  to  r^).  Within  these  forecast  compu¬ 
tations,  a  specific  forecast  simulation  i  (Fig.  1),  zoom  in 
bottom  row)  is  executed  during  tg  to  and  associated 
to  a  "simulation  time".  For  example,  the  ith  simulation 
starts  with  the  assimilation  and  adaptive  modeling  based 
on  observations  Tg,  then  integrates  the  dynamic  model 
with  data  assimilation  and  adaptive  modeling  based  on 
observations  Ti,  etc.,  up  to  the  last  observation  period 
T/c  which  corresponds  to  the  nowcast.  After  T^,  there 
are  no  new  data  available  and  the  simulation  enters  the 
forecasting  period  proper,  up  to  the  last  prediction  time 
T/c+n- 

2.2  Ocean  Acoustics 

As  one  of  the  major  application  of  underwater  acoustics, 
sonar  performance  prediction  requires  the  modeling  of 
the  acoustic  field  evolution.  The  parameters  include  the 
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Fig.  1.  Forecasting  timelines.  Top  row:  “Observation”  or 
“ocean”  time  T  during  which  measurements  are  made  and 
the  real  phenomena  occur.  Middle  row:  “Forecaster”  time 
during  which  the  k\h  forecasting  procedure  and  tasks 
are  started  and  finished.  Bottom  row:  “ith  simulation”  time 
T  which  covers  portions  of  the  real  “ocean”  time  for  each 
simulation.  Multiple  simulations  are  usually  distributed  on 
several  computers,  including  ensembles  of  forecasts  for 
uncertainty  predictions  (ESSE). 


four-dimensional  ocean  and  seabed  fields.  They  are  com¬ 
plex  to  predict  and  can  have  significant  uncertainties. 
Methods  and  systems  that  forecasts  the  ocean,  the  seabed 
and  the  acoustics  in  an  integrated  fashion  have  only 
been  developed  and  utilized  recently.  Our  approach  is 
based  on  coupling  data-assimilative  environmental  and 
acoustic  propagation  models  with  ensemble  simulations, 
as  developed  by  [5],  [6]. 

Having  an  estimate  of  the  ocean  temperature  and 
salinity  fields  (along  with  their  respective  uncertainties) 
provides  the  required  background  information  for  cal¬ 
culating  acoustic  fields  and  their  uncertainties.  Sound- 
propagation  studies  often  focus  on  vertical  sections. 
ESSE  ocean  physics  uncertainties  are  transferred  to 
acoustical  uncertainties  along  such  a  section.  Time  is 
fixed  and  an  acoustic  broadband  transmission  loss  (TL) 
field  is  computed  for  each  ocean  realization.  A  sound 
source  of  specific  frequency,  location  and  depth  is  cho¬ 
sen.  The  coupled  physical-acoustical  covariance  P  for 
the  section  is  computed  and  non-dimensionalized.  Its 
dominant  eigenvectors  (uncertainty  modes)  can  be  used 
for  coupled  physical-acoustical  assimilation  of  hydro- 
graphic  and  TL  data.  ESSE  has  also  been  extended  to 
acoustic  data  assimilation.  With  enough  compute  power 
one  can  compute  the  whole  "acoustic  climate"  in  a  three- 
dimensional  region,  providing  TL  for  any  source  and 
receiver  locations  in  the  region  as  a  function  of  time  and 
frequency,  by  running  multiple  independent  tasks  for 
different  sources /frequencies /slices  at  different  times. 
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3  Error  Subspace  Statistical  Estima¬ 
tion 

3.1  Formalism 

Using  continuous-discrete  Bayesian  estimation  [7]  and 
the  notation  of  [8],  the  spatially  discretized  version  of 
the  deterministic-stochastic  ocean  model  and  parameter 
equations  are  combined  into  a  single  equation  for  the 
augmented  state  vector  x,  of  large  but  finite  dimensions. 
Observations  are  taken  at  discrete  instants  tk  >  to  and 
are  concatenated  into  a  data  vector  y^.  The  dynamics, 
observations  and  DA  criterion  are  then, 

dx=M(x,  t)  +  drj  (1) 

yl=n{xk,tk)  +  ek  (2) 

min  J(x,  y^;  dij ,  ,  Q(i) ,  Rfc)  (3) 

X 

where  M  and  H  are  the  model  and  measurement  model 
operator,  respectively,  J  the  objective  function,  and  drj 
Wiener  processes  (Brownian  motion),  i.e.  rj  V(0,Q(i)) 
with 

E{dr](t)dr]^(t)}  =  Q{t)  dt.  Note  that  the  deterministic 
ocean  dynamics  and  parameter  equations  are  actually 
forced  by  noise  processes  correlated  in  time  and  space. 
State  augmentation  [7],  [9],  [10]  is  used  to  re-write 
equations  in  the  form  of  Eq.  1  which  are  forced  by 
intermediary  processes  dr]  white  in  time  and  space. 
Measurement  model  uncertainties  ek  are  assumed  white 
Gaussian  sequences,  ek  ^  Ar(0,R/c).  The  initial  condi¬ 
tions  have  a  prior  PDF,  p{x{to)),  i.e.  x(to)  =  xq  -1-  n(0) 
with  n(0)  random. 

Error  Subspace  Statistical  Estimation  (ESSE,  [2],  [11], 
[12])  intends  to  estimate  and  predict  the  largest  un¬ 
certainties,  and  combine  models  and  data  accordingly. 
When  the  DA  criterion  (Eq.  3)  guides  the  definition  of  the 
largest  uncertainties  or  "error  subspace",  the  suboptimal 
truncation  of  errors  in  the  full  space  is  optimal. 

ESSE  proceeds  to  generate  an  ensemble  of  model 
integrations  whose  initial  conditions  are  perturbed  with 
randomly  weighted  combinations  of  the  error  modes. 
A  central  (unperturbed)  forecast  is  also  generated.  The 
matrix  of  differences  between  each  perturbed  model 
realization  in  the  ensemble  and  the  central  forecast  is 
then  generated  and  an  estimate  of  the  conditional  mean 
is  produced.  A  singular  value  decomposition  (SVD)  of 
the  resulting  normalized  matrix  provides  us  with  the 
dominant  error  modes  (based  on  a  comparison  of  the 
singular  values).  A  convergence  criterion  compares  error 
subspaces  of  different  sizes.  Hence  the  dimensions  of  the 
ensemble  and  error  subspace  vary  in  time  in  accord  with 
data  and  dynamics.  The  whole  procedure  can  be  seen  in 
Figure  2. 

The  main  component  of  the  ESSE  scheme  that  is  used 
here  is  the  uncertainty  prediction.  An  initial  condition 
for  the  dominant  errors  is  assumed  computed  and  avail¬ 
able,  using  schemes  given  in  [13],  [14].  At  tk,  x/c(+)  is 
pertubed  (Eq.  6)  using  a  combination  of  error  modes 
E/c(+)  with  random  coefficients  7r;^(+)-  These  coefficients 
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Fig.  2.  The  ESSE  algorithm 


are  weighted  by  i7/e(+)  and  constrained  by  dynamics 
[11].  The  truncated  tail  of  the  error  spectrum  is  modeled 
by  random  white  noise  n;^.  For  the  evolution  to  a 
central  forecast  (Eq.  4)  and  an  ensemble  of  j  =  l,...,^' 
stochastic  ocean  model  integrations  is  run  (Eq.  7),  start¬ 
ing  from  the  perturbed  states  x;^(+).  The  forcings  dr]{t)  are 
defined  in  [2].  The  ES  forecast  (Eq.  9)  is  computed  from 
the  ensemble.  The  matrix  M/^+A-)  =  [xi^^i-)-Xk+i{-)] 
of  differences  between  q  realizations  and  an  estimate 
of  the  conditional  mean,  e.g.  x|^^(-)  in  Eq.  5,  is  then 
computed.  It  is  normalized  and  decomposed  (Eq.  9) 
into  iTfc+i(-)  =  i  17^+1  (-)  and  Efc+i(-)  of  rank  p  <  q 
by  singular  value  decomposition  (the  operator  SVDp(-) 
selects  the  rank-p  SVD).  The  ensemble  size  is  limited  by 
a  convergence  criterion  (Eq.  10).  The  coefficient  p  used 
here  measures  the  similarity  between  two  subspaces 
of  different  sizes  (  [15],  [16]).  A  "previous"  estimate 
(E,77)  of  rank  p  and  "new"  estimate  (E,77)  of  rank 
p  >  p  are  compared,  using  singular  values  to  weight 
singular  vectors.  The  scalar  limit  a  is  chosen  by  the 
user  (1  —  e  <  a  <  1).  cri(-)  selects  the  singular  value 
nurnber  i  and  k  =  min  (p,p).  When  p  is  close  to  one, 
(E,77)  is  the  error  forecast  for  4+i:  77/c+i(-),  E/e+i(-). 
The  dimensions  of  the  ensemble  (g)  and  ES  (p)  hence 
vary  with  time,  in  accord  with  data  and  dynamics. 


Central  fcst  :  (4) 

I  dx=M{%t)dt  , 

with  x/c  =  x/c(+) 

Ens.mean  :  (5) 

ESIn.  Cond.  ;  (6) 

x;^(+)=Xfc(+)  +  Efc(+)7r^(+)  +  n-^  , 

j=l,...,q. 

Ens.  Fcst :  (7) 

^k+i  (“)  I  ,t)dt  +  drj 

with  xi=xi(+) 

ESFcst ;  (8) 


Mfc+i(-)=[x;^_^^(-)  -  Xk+iscriptstyle{-)] 
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),Efe+i(-)  SVDp(]V[fe+i(-)) 

(9) 

=Ek+i(-)Sk+ii-)yl+ii-: 

Conv.Crit.  : 

(10) 

Eti  cTi(775E^E77^) 

>  a 

Acoustic  predictions  are  generated  using  acoustic 
propagation  models  and  newly  developed  parallel  soft¬ 
ware.  With  this  new  parallel  acoustic  software,  we  com¬ 
pute  the  whole  "acoustic  climate"  in  a  three-dimensional 
region,  providing  transmission  loss  (TL)  for  any  source 
and  receiver  locations  in  the  region  as  a  function  of  time 
and  frequency. 

4  ESSE  WORKFLOW 

The  ESSE  calculations  require  the  calculation  of  a  very 
large  ensemble  of  ocean  forecasts.  This  imposes  sig¬ 
nificant  demands  on  computational  power  and  stor¬ 
age.  ESSE  ensembles,  however,  differ  from  typical  high 
throughput  applications  such  as  parameter  scans  in  more 
than  one  way: 

1)  there  is  a  hard  deadline  associated  with  the  exe¬ 
cution  of  the  ensemble,  as  a  forecast  needs  to  be 
timely; 

2)  the  size  of  the  ensemble  is  dynamically  adjusted  ac¬ 
cording  to  the  convergence  of  the  ESSE  procedure; 

3)  individual  ensemble  members  are  not  significant 
(and  their  results  can  be  ignored  if  unavailable)  - 
what  is  important  is  the  statistical  coverage  of  the 
ensemble; 

4)  the  full  resulting  dataset  of  the  ensemble  member 
forecast  is  required,  not  just  a  small  set  of  numbers; 

5)  individual  forecasts  within  an  ensemble,  especially 
in  the  case  of  interdisciplinary  interactions  and 
nested  meshes,  can  be  parallel  programs  them¬ 
selves. 

Point  (1)  above  hints  towards  the  use  of  the  any 
Advanced  Reservation  capabilities  available;  point  (2) 
means  that  the  actual  compute  and  data  requirements 
for  the  forecast  are  not  known  beforehand  and  change 
dynamically;  point  (3)  suggests  that  failures  (due  to 
software  or  hardware  problems)  are  not  catastrophic  and 
can  be  tolerated  -  moreover  runs  that  have  not  finished 
(or  even  started)  by  the  forecast  deadline  can  be  safely 
ignored  provided  they  do  not  collectively  represent  a 
systematic  hole  in  the  statistical  coverage.  Point  (4) 
means  that  relatively  high  data  storage  and  network 
bandwidth  constraints  will  be  placed  on  the  underlying 
infrastructure  and  point  (5)  means  that  the  compute 
requirements  will  not  be  insignificant  either. 

In  the  case  of  the  ESSE  approach  to  Data  Assimilation, 
a  central  process  acts  as  a  job  shepherd  for  the  ensemble, 
as  shown  in  Eig  3:  A  loop  of  N  ensemble  members  is  first 
calculated,  each  member  consisting  of  a  perturbation  of 
the  initial  conditions /parameters  and  a  forecast.  After  all 


members  are  calculated,  the  difference  of  the  resulting 
forecast  from  a  central  forecast  is  calculated  in  a  loop, 
creating  a  large  file  containing  the  uncertainty  covari¬ 
ance  matrix.  A  Singular  Value  Decomposition  (SVD) 
of  this  matrix  ensues  followed  by  a  convergence  test 
with  the  result  of  the  previous  SVD.  If  convergence  is 
not  achieved,  the  process  loops  back  to  increase  N  to 
A2,  up  to  some  maximal  value  N^ax  or  until  the  time 
Tmax  available  for  the  forecast  expires.  The  process  then 
restarts  for  the  ensemble  members  A  +  1  to  N2.  This 
approach  suffers  from  several  bottlenecks: 

1)  The  perturb /forecast  loop  needs  to  finish  for  the 
diff  loop  to  start  (or  the  two  loops  can  be  fused 
(merged).  Either  way  there  is  no  exposed  paral¬ 
lelism. 

2)  The  diff  loop  has  a  serial  bottleneck  (the  same  file 
is  written  to).  Depending  on  the  variant  of  the 
perturbation  type  employed,  it  may  also  expect  to 
add  the  perturbations  to  the  uncertainty  covariance 
matrix  in  the  order  they  were  generated. 

3)  The  SVD /convergence  test  has  to  wait  for  the  diff 
loop  to  finish. 

4)  The  SVD  and  the  convergence  test  are  large  calcula¬ 
tions  requiring  a  lot  of  memory  and  time,  especially 
for  large  N. 

4.1  Parallelized  ESSE 

We  considered  a  natural  transformation  of  the  ESSE 
process  to  address  these  bottlenecks  and  increase  the 
amount  of  exploitable  parallelism,  transforming  the 
problem  into  one  amenable  to  MTC  techniques  -  see  also 
Eig  4.  Specifically  we  dealt  with  bottleneck  1  above  by 
replacing  the  concept  of  the  loop  with  that  of  a  pool  of 
ensemble  calculations,  of  size  M  >  N.  These  calcula¬ 
tions  can  be  done  concurrently  on  different  machines,  as 
there  is  no  actual  serial  dependence  in  the  forecasting 
loop.  They  would  in  effect  be  the  MTC  element  of  the 
forecasting  procedure.  We  then  decouple  the  diff  loop  by 
having  it  run  continuously,  adding  new  elements  to  the 
uncertainty  covariance  matrix,  as  they  become  available 
from  the  forecast  ensemble  calculations.  Eurthermore,  we 
relax  our  requirement  that  elements  of  the  covariance 
matrix  are  in  the  order  of  the  perturbation  number 
(bottleneck  2)  and  instead  keep  track  of  which  pertur¬ 
bation  is  added  every  time  for  bookkeeping  purposes. 
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Unfortunately  we  cannot  easily  do  away  with  the  single 
file  bottleneck  on  the  diff  loop  and  that  forces  us  to 
limit  the  diff  calculation  to  a  single  machine  with  access 
to  lots  of  disk  space  as  the  covariance  matrix  tends  to 
be  very  large  (0((N  G  V)2)  where  G  is  the  number  of 
3D  grid  points  and  V  the  number  of  physical  fields  and 
biochemical /physical  tracer  variables). 
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Fig.  4.  The  parallel  ESSE  implementation 


The  SVD  calculation  and  the  convergence  test  are  also 
decoupled  from  the  diff  loop  by  running  continuously  on 
their  own,  using  the  latest  result  available  from  the  diff 
loop.  To  fully  decouple  the  loops  without  introducing 
a  race  condition  on  the  covariance  matrix  file  between 
its  reading  for  the  SVD  and  its  writing  by  diff,  we 
employ  three  files,  a  safe  one  for  SVD  to  use  and  a 
live  alternating  pair  for  diff  to  write  to,  with  the  safe 
one  being  updated  by  the  the  appropriate  member  of 
the  pair.  The  SVD  calculation  and  the  convergence  test 
proceed  on  its  own  with  the  requirement  of  fast  I/O 
access  to  the  safe  file  and  a  machine  with  large  memory 
and  many  processors  for  the  parallel  SVD  calculation  on 
a  dense  matrix  (for  the  time  being  we  are  employing 
shared-memory  parallel  LAPACK  calls  though  the  use 
of  SCALAPACK  for  distributed  memory  clusters  may 
become  necessary  in  the  future  if  our  ensembles  get  too 
large). 

If  the  convergence  test  succeeds,  the  remaining  en¬ 
semble  members  (queued  for  execution  or  running)  are 
canceled,  and  depending  on  the  time  constraints  (for 
forecast  timeliness)  and  an  associated  policy,  either  the 
ensemble  calculation  concludes  immediately  or  the  re¬ 
maining  ensemble  results  already  calculated  are  diffed. 


another  SVD  calculation  is  performed  and  all  available 
results  are  used.  In  theory  one  could  also  spare  any 
ensemble  calculations  close  to  finishing  (according  to 
performance  estimates  for  the  machines  they  are  execut¬ 
ing  on  and  accumulated  runtime),  to  further  minimize 
the  wasted  cycles  at  the  expense  of  further  delays. 

If  the  convergence  test  fails  for  a  number  of  ensemble 
members  sufficiently  close  to  M  <  N^axr  the  ensemble 
pool  can  be  enlarged  (in  stages)  up  to  Nmax  (or  even 
slightly  more)  in  order  to  ensure  convergence  and  at  the 
same  time  make  sure  that  there  is  no  point  during  this 
process  where  the  pipeline  of  results  drains  and  the  SVD 
calculation  has  to  wait  (aside  from  the  startup  wait). 

4.2  Implementation  specifics 

The  ESSE  workflow  is  implemented  as  a  shell  script  in 
variants  targeting  either  Sun  Grid  Engine  (SGE)  [17]  or 
Condor  [18].  If  the  shell  script  catches  the  kill  signal  it 
proceeds  to  cancel  all  pending  jobs  and  do  some  cleanup. 
This  master  script  that  runs  on  a  central  machine  on 
the  home  cluster  launches  singleton  jobs  that  implement 
the  perturb /forecast  ensemble  calculations.  The  differ, 
SVD  and  convergence  check  calculations  proceed  semi- 
independently,  either  on  the  same  machine  as  the  master 
script  or  on  some  other  machine  with  access  to  the  same 
filesystem  and  lots  of  memory.  They  wait  to  ascertain 
that  a  multiple  of  a  set  number  of  realizations  has 
finished  and  then  they  run.  We  allow  for  variants  where 
the  perturb /forecast  ensemble  is  split  in  two,  first  all 
the  perturbations  are  generated  and  then  the  forecasts 
are  run.  This  makes  sense  only  in  case  that  there  are 
very  few  machines  with  good  network  connections  to 
the  storage  hosting  the  large  files  that  perturb  needs  to 
read.  In  that  case  it  makes  sense  to  restrict  the  execution 
of  pert  to  those  machines  only  and  split  the  ensemble 
workflow.  Dependencies  are  tracked  using  separate  (per 
perturbation  index)  files  containing  the  error  codes  of 
the  singleton  scripts  (which  are  set  on  purpose  to  sig¬ 
nify  success  or  failure).  These  files  reside  in  directories 
accessible  directly  or  indirectly  from  all  execution  hosts 
so  that  state  information  can  be  readily  shared.  Moreover 
the  perturbation  index  number  is  passed  on  to  each 
singleton  either  by  cleverly  altering  the  name  of  each 
job  submission  to  include  it  or  by  stripping  it  off  the 
task  array.  The  latter  approach  is  more  desirable  (as  it 
places  less  strain  on  the  job  scheduler)  but  if  the  ESSE 
execution  gets  stopped  ,  it  can  only  be  restarted  without 
rerunning  all  jobs  by  switching  to  a  one-job  submission 
per  perturbation  index  strategy. 

5  ESSE  AS  AN  MTC  APPLICATION  IN 
PRACTICE 

5.1  Special  ESSE  needs 

ESSE  and  other  similar  ensemble-based  ocean  forecast¬ 
ing  methodologies  are  used  a  several  times  a  year  in  a 
real-time  setting  during  live  ocean  experiments  lasting 


weeks  to  months.  In  the  past,  any  calculations  that  was 
more  involved  than  a  simple  serial  forecast  (possibly  em¬ 
ploying  objective  analysis  based  data  assimilation  which 
could  still  be  handled  by  a  powerful  on-board  worksta¬ 
tion)  had  to  be  performed  back  on  land.  Remote  com¬ 
puter  clusters  at  participating  academic /commercial  or 
military  institutions  were  used,  connected  via  slow  links 
to  the  ship-borne  measurement  apparatus.  Advances  in 
computer  system  and  networking  technology  have  now 
resulted  in  the  availability  of  a  ship-borne  computing 
infrastructure  (of  a  rack  or  even  deskside  form  factor)  to 
handle  pretty  large  basic  ensemble  calculations.  At  the 
same  time  the  constant  drive  for  higher  resolution,  better 
(and  more  usually  than  not  -  more  complex)  models 
and  comprehensive  error  subspace  representations  have 
resulted  in  considerably  larger  increases  of  the  compu¬ 
tational  demands.  In  practice  this  means  that  for  the 
"real-time"  requirements  to  be  satisfied,  the  use  of  land- 
based  clusters  is  still  required  for  the  more  involved 
ESSE  analyses. 

This  suggests  that  use  of  a  dedicated  home  cluster 
resource  is  definitely  worthwhile  as  such  a  system  is 
under  the  complete  control  of  the  Pis  and  can  be  devoted 
entirely  to  the  needs  of  the  real-time  experiment.  Such 
systems  are  also  necessary  because  a  lot  of  other  incu¬ 
bating  computations  are  required,  either  to  prepare  such 
experiments  and  develop  new  methods  and  software  for 
it,  or  to  carry  out  other  independent  research  work. 

Importantly,  the  local  home  cluster  resources  should 
be  augmented  by  remote  machines  that  are  not  under 
the  direct  control  of  the  user.  Such  resources  can  be 
provided  in  the  form  of  batch-controlled  allocations  on 
(in  the  case  of  the  USA  and  depending  on  the  sponsor¬ 
ing  agency)  NSE,  DoE,  DoD,  NOAA  or  NASA  shared 
compute  resources  or  more  generally  via  use  of  cloud 
computing  based  virtual  clusters,  such  as  Amazon's  EC2. 
Such  systems  can  be  utilized  on  demand,  as  a  function 
of  the  real-time  needs  over  limited  periods. 

5.2  I/O  analysis 

Beyond  the  compute  side  needs  of  the  major  components 
of  ESSE  we  have  to  consider  the  I/O  stresses  that  they 
put  on  our  storage  infrastructure.  Both  pert  and  pemodel 
(as  well  as  the  other  executables)  employ  NetCDE  [19] 
for  reasons  of  portability  -  this  however  tends  to  make 
I/O  requests  rather  opaque  as  they  are  hidden  under 
the  hood  of  the  NetCDE  library.  We  chose  to  use  a 
tool  that  looks  at  I/O  performance  without  requiring 
recompilation  or  other  type  of  instrumentation  of  exe¬ 
cutables:  strace_analyzer  [20]  employs  low  level  hooks 
in  the  operating  system  to  capture  read,  write  and  other 
I/O  requests.  In  the  parts  that  follow,  KB,  MB  etc.  refer 
to  powers  of  10  instead  of  powers  of  2,  e.g.  1KB=1000 
bytes  etc. 

5.2. 1  I/O  needs  of  the  perturbation  generators 
When  pert  is  run  on  the  local  filesystem  I/O  time  is 
about  27%  of  the  total  time.  As  can  be  seen  in  Table  1, 


TABLE  1 

pert  write/read  sizes 


I/O  size  range 
0-lKB 

#  of  requests 
11 

type 

read 

1-8KB 

3 

read 

8-32KB 

53603 

read 

0-lKB 

1 

write 

1-8KB 

1 

write 

8-32KB 

544 

write 

there  are  a  great  many  I/O  calls  of  a  small  size. 

The  total  number  of  bytes  written  was  4.5MB  in  546 
total  calls.  The  average  (mean)  bytes  per  write  call  were 
8,197  bytes,  with  a  standard  deviation  of  649  bytes.  The 
median  bytes  per  call  were  2  pages  (8192  bytes). 

The  total  number  of  bytes  read  was  435.5MB,  in  53,072 
total  calls.  The  average  (mean)  bytes  per  read  call  was 
8,206  bytes,  with  a  standard  deviation  of  373  bytes.  The 
median  bytes  per  call  were  as  before  2  pages. 

A  total  of  4,363,806  bytes  was  read  from  the  cen¬ 
tral  forecast  file,  426,639,520  was  read  from  the  error 
subspace  matrix  and  4,494,987  bytes  were  read  with 
4,472,832  written  to  the  perturbed  ICs. 


Fig.  5.  Read/write  I/O  request  sizes  for  pert  as  a  function 
of  time 

Looking  at  a  breakdown  in  time  of  the  I/O  requests 
in  Eigure  5  we  see  that  there  is  a  continuous  set  of  reads 
throughout  the  run  in  two  sizes  (8  &  16KB);  a  write 
happens  at  the  very  beginning  and  the  rest  at  the  end. 

Looking  at  a  breakdown  in  throughput  of  the  I/O 
requests  in  Eigure  6  we  see  that  there  is  a  wide  spread  of 
read  bandwidth  as  seen  by  the  application  (ie.  including 
filesystem  cache  effects)  throughout  the  run.  Similarly 
for  the  writes  at  the  end  of  the  run,  one  sees  a  range  of 
performance  values. 

Overall  we  can  state  that  pert  is  partly  I/O  bound  (but 
finishes  quickly  on  local  disk)  -  the  main  I/O  involves 
reading  part  of  the  error  subspace  matrix.  The  individual 
I/O  calls  by  the  NetCDE  library  are  both  numerous 
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Fig.  6.  Read/write  I/O  throughput  (in  MB/s)  for  pert  as  a 
function  of  time  -  filesystem  cache  is  enabled 


and  3.9MB  are  read  from  the  pressure  bias  and  shapiro 
filter  input  files  respectively. 


TABLE  2 

pemodel  write/read  sizes 


Fig.  7.  Read/write  I/O  request  sizes  for  pemodel  as  a 
function  of  time 


I/O  size  range 

#  of  requests 

type 

0-lKB 

3290 

read 

1-8KB 

4227 

read 

8-32KB 

137133 

read 

0-lKB 

2019 

write 

1-8KB 

148 

write 

8-32KB 

59532 

write 

Looking  at  a  breakdown  in  time  of  the  I/O  requests 
in  Figure  7  we  see  that  there  is  a  flurry  of  activity  at  the 
very  beginning  and  end  of  the  run;  reads  continue  in 
two  main  sizes  for  the  first  3rd  of  the  run  or  so.  Writes 
are  very  small  and  spread  throughout. 


and  small:  The  calls  at  a  granularity  that  is  smaller 
than  a  page  are  insignificant  but  such  a  pattern  is  not 
well  suited  for  large  parallel  filesystems  (such  as  PVFS2) 
which  are  tuned  for  large  streaming  stores.  Moreover 
the  internal  structure  of  the  NetCDF  files  appears  to 
necessitate  a  lot  of  reading  and  then  writing  of  the  output 
file  which  in  many  cases  can  cause  trouble  with  NFS. 

5.2.2  I/O  needs  of  the  ocean  model 

When  pemodel  is  run  on  the  local  filesystem  I/O  time 
is  only  about  0.24%  of  the  total  time  (ie.  the  code,  as 
currently  setup  is  not  I/O  limited).  As  can  be  seen  in 
Table  2,  there  are  a  great  many  I/O  calls  of  a  small  size. 

The  total  number  of  bytes  written  was  492MB  in 
61,699  total  calls.  The  average  (mean)  bytes  per  write 
call  were  7,976  bytes,  with  a  standard  deviation  of  1,584 
bytes.  The  median  bytes  per  call  were  again  2  pages 
(8,192  bytes) 

The  total  number  of  bytes  read  was  679MB,  in  82,952 
total  calls.  The  average  (mean)  bytes  per  read  call  was 
8,187  bytes,  with  a  standard  deviation  of  2,121  bytes.  The 
median  bytes  per  call  were  as  before  2  pages. 

A  total  of  173,283,024  bytes  were  read  from  the  forcing 
file,  8,541,726  were  read  from  the  IC  and  480,431,123 
were  read  from,  with  490,541,996  being  written  to  the 
main  output  file  -  another  468,564/721,57  bytes  are  read 
from/written  to  a  secondary  output  file,  while  12.4M5 


Fig.  8.  Read/write  throughput  (in  MB/s)  for  pemodel  as  a 
function  of  time  -  filesystem  cache  is  enabled 

Looking  at  a  breakdown  in  throughput  of  the  I/O 
requests  in  Figure  8  we  see  that  there  is  a  wide  spread 
of  bandwidth  as  seen  by  the  application  (ie.  including 
filesystem  cache  effects)  throughout  the  run.  A  range  of 
low  to  high  perfromance  is  seen  at  the  beginning  and 
the  end;  the  small  writes  in  the  rest  of  the  run  cannot 
achieve  high  performance. 

Overall  we  can  state  that  pemodel  is  not  I/O  bound 
when  working  out  of  local  disk  -  most  of  the  I/O 


involves  reading  the  forcing  and  reading /writing  the 
output  files.  As  before  the  nature  of  NetCDF  necessitates 
a  lot  of  rereading  of  the  output  file. 

5.3  Performance  implications  over  NFS 

If  we  are  to  look  at  the  performance  of  pert  with 
input/output  files  stored  over  NFS  (Gigabit  Ethernet 
high  performance  network  used  by  the  Steele  Cluster 
at  Purdue  University)  we  discover  that  the  total  runtime 
can  increase  by  a  very  large  amount  -  from  a  few  seconds 
to  2.5  minutes  or  more  (274+  secs  in  Figure  9.  In  fact  in 
the  past,  over  more  constrained  LANs  (100Mbit  mix  of 
switches  and  hubs)  we  have  seen  times  as  high  as  700+ 
seconds,  making  pert  very  expensive. 
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Fig.  9.  Read/write  throughput  (in  MB/s)  for  pert  over  NFS 
as  a  function  of  time  -  filesystem  cache  is  enabled 

The  performance  drop  may  not  show  in  the  percentage 
of  time  spent  doing  I/O,  indeed  that  may  decrease,  as 
the  extra  time  appears  simply  as  wallclock  time  -  the 
process  idles  as  the  kernel  waits  on  NFS  activity.  Thus 
the  individual  bandwidth  figures  in  Figure  9  appear  to 
cover  a  decent  range  of  values  -  it  is  the  total  wallclock 
time  that  suffers.  This  is  a  problem  even  with  a  single 
pert  reading/ writing  over  NFS  -  multiple  ones  (as  would 
be  the  case  in  a  production  ensemble)  can  only  make 
the  problem  worse.  This  suggests  that  use  of  diskless 
clusters  is  not  well  suited  to  this  application. 

5.4  Local  cluster  description 

Our  local  cluster  is  composed  of  114  dual  socket  Opteron 
250  (2.4GHz)  nodes  (1  with  16G  RAM,  2  with  8GB  and 
the  rest  with  4GB),  3  dual  socket  Opteron  285  (dual  core 
2.6GHz)  nodes,  all  with  4GB  RAM  (replacement  nodes), 
and  a  dual  socket  Opteron  2380  (Shanghai  generation, 
quad  core  2.5GHz)  head  node  with  24GB  RAM.  The  file- 
server  serves  over  18TB  of  shared  disk  over  NFS,  using 
a  10Gbit/ s  connection  to  a  200Gbit/ s  switch  backbone. 
All  nodes  have  a  Gigabit  Ethernet  connection  to  switches 


arranged  in  a  star  formation,  feeding  into  the  central 
switch.  The  cluster  has  both  SGE  and  Condor  installed 
and  active  (at  the  same  time).  Condor  is  setup  to  consider 
nodes  used  by  SGE  as  claimed  by  their  "owner"  so  the 
two  systems  can  coexist  (with  Condor  giving  precedence 
to  SGE).  All  users  tend  to  use  only  one  of  the  two 
systems  at  the  time. 

5.4.1  Timings 

Eor  the  timings  discussed  below  about  210  of  the  240 
cores  were  available  -  the  rest  were  in  use  by  other 
users.  We  tested  two  scenarios:  one  that  uses  NES  for 
the  large  input  files  and  another  that  prestages  (to  every 
local  disk)  all  input  files  so  that  all  input  is  local.  We  did 
not  test  the  case  where  both  input  and  output  files  live 
on  the  NES  server  for  the  duration  of  the  execution  of  the 
singletons  as  it  places  too  much  stress  on  the  NES  server 
and  is  disruptive  to  other  users.  Therefore  in  all  cases  the 
useful  output  files  are  copied  back  to  the  NES  server  at 
the  end  of  their  job.  In  all  cases  the  differencing,  SVD 
and  convergence  check  calculations  were  happening  on 
the  master  node. 

This  I/O  optimization  made  more  of 
a  difference  for  the  perturbation  part 
of  the  algorith  where  CPU  utilization 
jumped  from  20%  to  100%.  The  initial  conditions 
generated  thus  and  used  for  the  ensemble  model  runs 
are  stored  on  the  local  directory  anyway  and  therefore 
this  (more  expensive)  part  of  the  ESSE  procedure  does 
not  as  much  of  a  performance  boost.  600  ensemble 
members  pass  through  the  ESSE  workflow  in  llmins 
in  the  all  local  I/O  case  and  in  S6mins  in  the  mixed 
locality  case.  As  all  nodes  were  equally  close  to  the 
fileserver  we  deed  not  deem  it  necessary  to  test  the 
ESSE  variant  where  the  perturbation  calculation  is  done 
in  a  separate  job  submission  from  that  of  the  PE  model. 
Eor  both  SGE  and  Condor  we  used  job  arrays  to  lessen 
the  load  on  the  scheduler. 

Timings  under  Condor  were  between  10  —  20%  slower. 
Essentially  the  difference  could  be  seen  in  the  time  it 
took  for  the  queuing  system  to  reassign  a  new  job  to 
a  node  that  just  finished  one.  In  the  case  of  SGE  the 
transition  was  immediate  -  Condor  appeared  to  want  to 
wait.  We  tweaked  the  configuration  files  to  diminish  this 
difference  in  throughput  which  is  probably  due  to  the 
effort  put  in  Condor  to  function  as  a  very  successful  cycle 
harvester  and  the  resulting  care  it  takes  not  to  disrupt 
everyday  desktop  usage. 

The  ESSE  calculation  was  followed  by  more  than  6000 
ocean  acoustics  realizations  -  each  of  which  executed  for 
approximately  3  minutes  -  in  this  case  no  job  arrays  were 
used  and  the  system  handled  all  6000+  jobs  without  any 
problem  whatsoever. 

5.5  ESSE  on  the  Grid 

The  task  at  hand  is  to  augment  the  ESSE  ensemble 
size  by  employing  remote  resources  (usually  but  not 
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always  Grid-enabled).  That  could  be  either  a  depart¬ 
mental  cluster  within  the  same  overall  organization,  a 
partner  institution  Grid  or  the  large-scale  national  and 
international  Grid  infrastructures  such  as  the  Teragrid, 
Open  Science  Grid,  EGEE  etc. 

The  disadvantage  of  dealing  with  Grid 
resources  is  that  they  come  with  a  wide 
variety  of  rather  heavyweight  middleware 
(such  as  Globus,gLite,Unicore5/  6,OMll-UK,ARC, 
GRIA)  that  are  not  very  easy  to  install  and  require 
maintenance  over  time.  In  this  manner  they  represent 
an  additional  burden  on  both  the  users  and  the 
administrators. 

5. 5. 1  Scheduling  ensembles 

The  easiest  (while  at  the  same  time  least  flexible)  way  to 
add  Grid  resources  for  the  execution  our  ensembles  was 
remote  submission /cancellation  of  jobs  (using  (gsi)ssh 
+  the  local  job  manager  commands)  either  individually 
or  as  a  job  array.  Essentially  a  small  part  of  the  ESSE 
master  script  dealing  with  job  submission/cancellation 
is  replicated  on  the  remote  resource,  singleton  scripts 
particular  to  the  remote  system  in  question  are  submitted 
and  no  complicated  logic  is  needed  to  make  them  work 
as  they  are  not  generic.  The  directories  that  keep  track  of 
job  submissions /completions  etc.  on  the  home  cluster  are 
either  mounted  on  the  remote  system  using  XUES  [21], 
SSHES  [22]  etc.  or  they  are  updated  using  passwordless 
SCP  connections  (to  avoid  requiring  setting  up  Globus  or 
other  Grid  infrastructure  servers  on  the  home  cluster  end. 
This  approach  gives  no  easy  way  for  the  user  to  monitor 
the  progress  of  one's  jobs  (other  than  to  try  to  monitor 
the  contents  of  the  submission /completion  directories). 
One  needs  to  take  care  to  assign  a  clearly  separated  block 
of  ensemble  members  to  these  external  Grid  execution 
hosts  to  avoid  overlaps. 

A  different  path  is  offered  by  the  wide  availability  of 
the  Condor  software.  The  existing  Condor  implementa¬ 
tion  of  ESSE  needs  to  be  slightly  adjusted  to  allow  for 
use  of  remote  clusters  either  via  flocking,  Condor-C  or 
Condor-Glidein.  Unfortunately  all  of  these  approaches 
entail  modification  of  the  configuration  of  the  home 
Condor  cluster  and  sometimes  even  of  the  remote  cluster 
-  something  we  are  able  to  do  locally  but  in  general 
a  non-privileged  user  cannot  do.  Eurther  issues  (which 
can  be  avoided  with  careful  configuration  choices)  can 
arise  when  other  users'  jobs  (also  submitted  to  the 
local  Condor  queues)  end  up  on  remote  Grid  resources 
they  cannot  be  executed  on.  The  remaining  alternative, 
Condor-G,  on  the  other  hand  is  not  as  capable  of  han¬ 
dling  so  many  jobs  as  we  are  envisioning. 

One  other  possibility  (which  circumvents  these  prob¬ 
lems)  is  the  use  of  Personal  Condor  (in  which  case 
all  local  configuration  files  are  owned  by  the  user), 
connecting  via  Condor-Glidein  to  both  the  local  Condor 
pool  and  the  remote  clusters.  A  related  effort  which  we 
plan  to  investigate  further  is  the  use  of  the  MyCluster 
[23]  software  that  makes  a  collection  of  remote  and  local 


TABLE  3 

pert/pemodel  performance  (time  to  completion  in 
seconds)  on  a  few  Teragrid  platforms 


site 

processor  type 

pert 

pemodel 

ORNL 

Pentium4  3.06MHz 

67.83 

1823.99 

Purdue 

Core2  2.33MHz 

6.25 

1107.40 

local 

Opteron  250  2.4GHz 

6.21 

1531.33 

resources  appear  as  one  large  Condor  or  SGE  controlled 
cluster.  This  way  we  we  are  not  limited  to  Condor  but 
we  can  use  our  SGE-based  setup  instead. 

5.5.2  I/O  issues 

There  are  significant  I/O  issues  that  need  to  be  ad¬ 
dressed  when  considering  the  use  of  remote  resources 
for  ESSE  ensembles.  As  a  minimum  requirement  the 
shared  input  files  can  be  read  remotely  from  Open- 
DAP  servers  at  the  home  institution  (using  the  NetCDP- 
OpenDAP  library)  allowing  the  immediate  opportunistic 
use  of  a  remote  resource  that  is  discovered  to  be  idling. 
The  performance  implications  of  such  an  approach  how¬ 
ever  (hundreds  of  requests  to  a  central  OpenDAP  server 
make  it  a  less  desirable  solution.  Therefore  one  is  more 
likely  to  employ  manual  prestaging  of  the  input  files  - 
use  of  shared  filesystems  over  a  WAN  can  help  speed 
up  such  operations  (e.g.  one  copy  from  home  to  gpfs- 
wan  and  then  a  fast  distribution  from  gpfs-wan  to  local 
fast  disks.  Use  of  data  staging  engines  such  as  Stork 
are  another  possibility,  provided  they  work  with  our 
scheduler. 

When  it  comes  to  the  output  files,  one  has  the  choice 
of  either  a  push  model  (from  the  remote  execution  hosts 
back  to  the  home  cluster  or  a  pull  model  (a  pull-agent  on 
the  home  cluster  fetching  files  from  a  central  repository 
for  each  of  the  remote  clusters).  The  former  method  is 
the  simplest  one  requiring  the  least  book-keeping  -  at 
the  same  time  it  requires  nodes  that  can  talk  to  the 
outside  world  and  the  batch  nature  of  the  runs  results 
in  a  very  large  number  of  concurrent  remote  transfer 
attempts  followed  by  no  network  activity  whatsoever. 
This  can  seriously  slow  down  the  gateway  nodes  of 
the  home  cluster.  The  pull  model  requires  more  work  (a 
separate  agent,  notifications  that  files  have  been  copied 
so  they  can  be  safely  deleted  etc.)  but  can  pace  the  file 
transfers  so  that  they  happen  more  or  less  continuously 
and  perform  much  better.  A  third  alternative  introduces 
a  two-stage  put  strategy  -  with  nodes  storing  their  out¬ 
put  on  a  shared  filesystem  and  an  independent  agent 
transferring  them  over  to  the  home  cluster. 

5. 5. 3  Computational  issues 

An  idea  of  the  speeds  of  Teragrid  hosts  running  pemodel 
and  pert  vs.  the  speeds  seen  on  our  local  home  cluster  is 
shown  in  Table  3. 

As  one  can  see  speeds  vary  appreciably  (and  a  recom¬ 
pilation,  however  inconvenient  it  may  be  -  especially  for 
a  last  minute  change  of  code)  can  be  well  worth  it.  The 
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slow  pert  performance  for  ORNL  appears  to  be  mainly 
related  to  the  PVFS2  filesystem  used  -  the  Purdue  runs 
employ  a  local  fast  filesystem.  In  practice  this  means  that 
the  more  disparate  the  hosts  used  to  augment  the  local 
compute  facilities,  the  more  uneven  the  progress  of  the 
various  remote  clusters  will  be  and  perturbation  900  may 
very  well  finish  well  before  number  700. 

5.5.4  Evaluating  ESSE  on  the  Grid 

There  are  many  advantages  to  using  the  Grid  to  augment 

local  compute  resources  for  ESSE: 

•  There  are  a  great  many  computational  resources 
available  on  the  Grid  allowing  for  .  Teragrid's  Con¬ 
dor  pool  is  claimed  to  be  almost  27,000  cores  but 
at  the  time  of  the  writing  of  this  paper  only  about 
1828  appeared  to  be  available  for  use,  with  around 
100  at  a  time  free  to  run  a  user  job. 

•  Many  Grid-enabled  systems  have  been  designed 
with  massive  I/O  requirements  in  mind,  allowing 
for  fast  access  from  many  nodes  to  a  shared  filesys¬ 
tem. 

•  Similarly  large  shared  Grid-enabled  systems  usually 
have  excellent  connectivity  to  the  fastest  Internet 
backbone  and  allow  for  fast  file  transfers  to  and 
from  the  home  system. 

At  the  same  time  there  are  significant  disadvantages 
to  using  the  Grid: 

1)  Each  remote  resource  is  slightly  to  very  differ¬ 
ent  in  hardware,  software  (O/S,  compilers  and 
libraries)  and  filesystem  configuration.  This  means 
that  the  user  is  faced  not  only  with  having  to 
rebuilt  and  redeploy  the  code  binaries  every  time 
but  also  with  modifying  variables  in  the  singleton 
execution  scripts  to  match  the  particulars  of  the 
filesystem/ operating  system  setup  at  hand. 

2)  Due  to  the  shared  nature  of  resources  on  large 
external  centers  one  cannot  be  sure  that  there  will 
be  enough  nodes  on  a  single  resource  to  reach 
the  capacity  needed.  In  the  absence  of  advance 
reservation  the  jobs  submitted  may  very  well  end 
up  running  on  the  following  day  (or  in  any  case 
outside  the  useful  time  window  for  ocean  forecasts 
to  be  issued).  So  many  different  Grid  resources  at 
the  same  time  would  have  to  be  employed  (with 
the  resulting  increase  in  complexity). 

3)  A  careful  estimate  of  the  duration  of  the  jobs  can 
help  in  case  backfilling  is  employed  on  the  queuing 
system  of  the  Grid  resource  but  even  in  that  case 
commonly  used  limitations  of  active  jobs  (irrespec¬ 
tive  of  total  core  count)  per  user  can  throttle  back 
performance  expectations. 

4)  Moreover  in  many  cases  the  queuing  system  sched¬ 
uler  has  been  tuned  to  prioritize  large  core  count 
parallel  jobs  and  thereby  penalize  massive  task 
parallelism  workloads.  In  that  case  one  needs  to 
refactor  singleton  jobs  to  batches  of  singletons  pack¬ 
aged  as  a  single  job  (with  all  the  extra  trouble  this 
refactoring  can  introduce). 


Advance  reservations  (which  are  not  yet  widely  avail¬ 
able  if  at  all  possible)  will  be  necessary  to  ensure  that 
a  sufficient  number  of  cpu  power  will  be  available. 
Experiments  are  planned  ahead  of  time  to  allow  for  such 
reservations  to  be  made  but  their  daily  time  boundaries 
cannot  be  very  tight. 

Another  issue  with  the  MPP  platforms  available  on  the 
Grid  that  offer  massive  numbers  of  processors  for  high 
throughput/ massive  task  parallelism  type  of  workloads 
is  that  their  I/O  configuration  and  support  for  running 
scripts  can  be  limited.  Case  in  point  are  the  IBM  Blue 
Gene/L  systems  (like  NCAR's  Frost  on  the  Teragrid) 
which  share  one  I/O  node  for  a  number  of  compute 
nodes  and  does  not  offer  a  complete  O/S  environment 
on  the  compute  node  to  support  running  a  script.  Full 
support  for  running  shell  scripts  on  MPP  compute  nodes 
unfortunately  may  go  against  the  general  philosophy  of 
having  them  run  a  minimized  O/S  in  order  to  better 
perform  when  running  closely  coupled  parallel  codes. 

5.6  ESSE  in  the  Cloud 

The  emerging  Cloud  Computing  infrastructure  offers 
us  a  different  avenue  we  can  pursue  to  augment  the 
ESSE  ensemble  size.  Given  our  needs  we  are  interested 
in  the  laaS  (Infrastructure  as  a  Service)  form  of  Cloud 
Computing  services.  In  particular  we  have  experimented 
with  what  is  currently  the  most  easy  to  use  laaS  system, 
Amazon's  EC2. 

5. 6. 1  Scheduling  ensembles 

EC2  offers  a  set  of  tools  that  allow  the  provisioning 
and  booting  of  various  Linux,  Solaris  and  Windows 
Xen  virtual  machine  images  (called  AMIs)  and  allows 
the  remote  user  to  login  to  them  as  an  administrator 
and  control  them  accordingly.  There  is  also  control  over 
which  ports  each  live  instance  has  open  to  the  internal 
EC2  network  as  well  as  the  outside  world.  This  level  of 
complete  control  allows  us  a  wide  variety  of  options  on 
how  to  use  EC2  provisioned  nodes  for  ESSE  calculations: 

•  Creation  of  an  independent  on-demand  cluster,  with 
its  own  master  node  and  queuing  system  and  re¬ 
mote  submission  of  jobs  in  the  same  way  as  for  a 
generic  remote  cluster /Grid  environment. 

•  Addition  of  the  EC2  nodes  to  the  home  cluster  as 
extra  compute  nodes.  This  has  already  been  demon¬ 
strated  for  GridEngine  and  we  have  been  able  to 
replicate  it.  Condor  also  offers  the  ability  to  launch 
jobs  on  Amazon  EC2  nodes  but  the  way  that  they 
are  provisioned  (essentially  as  a  job)  and  controlled 
is  too  restrictive  for  our  needs. 

•  Creation  of  a  personal  (Condor  or  SGE)  private  clus¬ 
ter  using  My  Cluster  mixing  local  and  EC2  resources. 

•  Dynamic  addition  of  EC2  nodes  to  an  existing  clus¬ 
ter  -  offered  in  product  form  by  Univa  (UniCloud) 
and  Sun  (Cloud  Adapter  in  Hedeby/SDM). 

This  last  option  automates  the  booting/termination  of 
EC2  nodes  based  on  queuing  system  demand,  further 
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TABLE  4 

pert/pemodel  performance  (time  to  completion  in 
seconds)  on  various  EC2  instance  types  -  Opt  stands  for 
Opteron 


site 

processor  type 

pert 

pemodel 

cores 

ml.  small 

Opt  DC  2.6GHz 

13.53 

2850.14 

0.5 

ml. large 

Opt  DC  2.0GHz 

9.33 

1817.13 

2 

ml.xlarge 

Opt  DC  2.0GHz 

9.14 

1860.81 

4 

cl. medium 

Core2  2.33GHz 

9.80 

1008.11 

2 

cl.xlarge 

Core2  2.33GHz 

6.67 

1030.42 

8 

m2.2xlarge 

Nehalem  2.67GHz 

3.39 

779.77 

4 

m2.4xlarge 

Nehalem  2.67GHz 

3.35 

790.86 

8 

minimizing  costs.  Most  of  the  options  allow  for  minimal 
changes  to  the  generic  SGE  setup. 

5. 6.2  I/O  and  computational  issues 

The  I/O  issues  of  the  Amazon  EC2  option  are  similar 
to  the  Grid  ones  but  compounded  by  the  fact  that 
neither  the  networking  nor  the  disk  hardware  are  geared 
towards  high  performance  computing.  Similar  solutions 
can  be  adopted,  with  an  emphasis  on  avoiding  issues 
resulting  from  the  relatively  low  network  bandwidth  of 
EC2  to  the  outside  world.  Any  common  staging  areas  can 
be  provided  either  via  NES  exporting  a  persistent  EBS 
volume  or  populating  an  on-demand  created  parallel 
filesystem  with  data  from  EBS.  In  the  latter  case  extra 
work  needs  to  be  made  to  ensure  that  the  AMIs  can 
function  as  clients  for  the  parallel  filesystem.  Given  the 
issues  with  NES  performance  discussed  in  Section  5  NES 
staging  areas  should  be  abused  for  massive  concurrent 
I/O  from  remote  nodes,  however  appealing  the  simplic¬ 
ity  of  such  an  approach  may  appear. 

An  idea  of  the  times  of  several  EC2  instances  running 
pemodel  and  pert  for  various  instance  types  is  shown 
in  Table  4.  As  of  the  last  few  months  of  2009  the  new 
Amazon  datacenters  have  introduced  new  varieties  of 
actual  hardware  for  the  various  EC2  instance  types  - 
hence  the  ml. small  (standard)  instance  can  be  either 
AMD  Opteron  -  2.0GHz  or  2.6GHz  dual  core  (DC)  based 
-  or  Intel  (Core2)  Xeon  54xx  series  (Penryn  generation) 
quad  core  (QC)  based.  The  latter  type  of  hardware  can 
offer  even  better  performance  that  the  one  shown  in 
Table  4.  Amazon  recently  introduced  Intel  Nehalem- 
based  instance  types  with  increased  memory  capacity, 
effectively  allowing  all  of  the  ESSE  calculations  to  be 
conducted  in  the  Cloud,  even  for  very  large  ensemble 
sizes  as  the  nodes  have  enough  memory  to  handle  the 
SVD  of  extremely  large  matrices. 

In  all  the  cases  shown  the  instance  type  was  fully 
utilized  (ie.  8  copies  of  pert/pemodel  were  run  con¬ 
currently  on  a  cl.xlarge  instance.  The  ml. small  instance 
appears  as  a  1  core  but  is  in  fact  limited  to  a  maximum  of 
50%  (or  even  40%  for  newer  cpu  types)  cpu  utilization, 
hence  appearing  as  a  half-core.  The  executables  (and 
software  environment)  were  identical  to  those  on  the 
home  cluster.  In  each  case  the  worst  time  of  the  batch 
is  being  reported.  Despite  the  faster  Nehalem  generation 


processors  available  on  the  m2  series  instances,  the  most 
cost-effective  approach  balancing  price  and  performance 
appears  to  be  cl. medium  (or  cl.xlarge  if  one  wants  to 
maximize  the  number  of  execution  cores  for  a  given 
number  of  instances). 

Cost-wise  for  example  an  ESSE  calculation  with  1.5GB 
input  data,  960  ensemble  members  each  sending  back 
11MB  (for  a  total  of  6.6GB)  would  cost:  1.5{GB)  x  0.1  + 
10.56(G5)  X  0.17  +  2{hr)  *  20  *  0.68  =  $29.15  Use  of 
reserved  or  spot  instances  would  drop  pricing  for  the 
cpu  usage  by  a  bit  less  than  a  factor  of  3. 

5.6.3  Evaluating  ESSE  on  EC2 

There  are  quite  a  few  clear  advantages  to  using  EC2  for 
larger  ESSE  ensembles: 

•  Eor  all  intents  and  purposes  the  response  is  imme¬ 
diate.  EC2's  capacity  is  large  enough  that  a  request 
for  a  virtual  EC2  cluster  gets  satisfied  on-demand, 
without  having  to  worry  about  queue  times  and 
backfill  slots. 

•  The  use  of  virtual  machines  allows  for  deploying  the 
same  environment  as  the  home  cluster.  This  provides 
for  a  very  clean  integration  of  the  two  clusters. 

•  Having  the  same  software  environment  also  results 
in  no  need  to  rebuild  (and  in  most  cases  having  to 
revalidate)  executables.  This  means  that  last  minute 
changes  (because  of  model  build-time  parameter 
tuning)  can  be  used  ASAP  instead  of  having  to  go 
through  a  build-test-deploy  cycle  on  each  remote 
platform. 

•  EC2  allows  our  virtual  clusters  to  scale  at  will:  There 
is  a  default  20  instance  limit  (which  correspond  to  a 
maximum  configuration  of  160  cores)  but  if  needed 
it  can  be  increased  upon  request. 

•  Since  the  remote  machines  are  under  our  complete 
control,  scheduling  software  and  policies  etc.  can  be 
tuned  exactly  to  our  needs. 

At  the  same  time  use  of  EC2  is  not  without  it's  problems: 

•  Unlike  the  case  of  shared  state  or  national  resources 
that  come  out  of  research  grant  related  allocations, 
EC2  usage  needs  to  be  directly  paid  to  Amazon. 

•  Amazon  charges  by  the  hour  -  much  like  cell-phone 
charges  usage  of  1  hour  1  sec.  counts  as  2  hours. 
Moreover  Amazon  charges  for  data  movement  in 
and  out  of  EC2. 

•  The  performance  of  virtual  machines  is  less  than 
that  of  "bare  metal",  the  difference  being  more 
pronounced  when  it  comes  to  I/O. 

•  Unlike  purpose-build  parallel  clusters,  EC2  does 
not  offer  a  persistent  large  parallel  filesystem.  One 
can  be  constructed  on  demand  (just  like  the  virtual 
clusters)  but  the  Gigabit  Ethernet  connectivity  used 
throughout  Amazon  EC2  alongside  the  random¬ 
ization  of  instance  placement  mean  that  parallel 
performance  of  the  filesystem  is  not  up  to  par. 

•  Moreover,  unlike  national  and  state  supercomputing 
facilities,  Amazon's  connections  to  the  home  cluster 
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are  bound  to  be  slower  and  result  in  file  transfer 
delays. 

The  substandard  (single  port  Gigabit  Ethernet  based) 
interconnect  that  an  EC2  virtual  cluster  provides  should 
not  be  so  much  of  an  issue  for  future  ESSE  ensembles 
employing  nested  calculations:  Two-way  nesting  would 
be  run  on  2-core  instances  (or  two. four  of  them  could  be 
"packed"  on  4 /8-core  instances),  utilizing  shared  mem¬ 
ory  for  fast  communications  between  the  nested  models. 
Even  the  far  less  commonly  used  three-way  nesting 
could  run  faster  on  oversubscribed  2-core  instances  than 
spread  over  multiple  nodes. 

6  Example  ESSE  results  for  Monterey 
Bay 

A  large  Office  of  Naval  Research  (ONR)-sponsored, 
multi-institution  coastal  predictive  skill  exercise,the  Au¬ 
tonomous  Ocean  Sampling  Network-II  (AOSN-11),  oc¬ 
curred  in  August-September  2003  in  the  Monterey  Bay 
region  off  central  California. The  goal  of  this  exercise 
was  to  initiate  at-sea  research  of  an  adaptive  observing 
and  prediction  system,  with  the  intent  to  assimilate 
various  data  types,  adapt  the  deployment  of  platforms 
and  allow  the  relocation  of  the  system  to  other  regions. 
The  Harvard  Ocean  Prediction  System(HOPS)  and  Er¬ 
ror  Subspace  Statistical  Estimation(ESSE)  system  were 
utilized  in  real-time  to  forecast  physical  fields  and  un¬ 
certainties,  assimilate  various  ocean  measurements(CTD, 
AUVs,  gliders  and  SSTdata),  provide  suggestions  for 
adaptive  sampling,  and  guide  dynamical  investigations. 

To  exercise  our  new  MTC  implementation  of  ESSE,  we 
repeated  the  calculations  of  AOSN-II.  The  ESSE  forecast 
for  September  5,  00:00  GMT  was  initialized  from  an  error 
nowcast  for  September  3,  00:00  GMT.  The  background 
ocean  field  on  September  3,  00:00  GMT  is  a  HOPS 
forecast  simulation  which  assimilates  all  available  and 
calibrated  data  up  to  September  2,  10:00GMT. 

The  dominant  600  eigenvectors  of  the  posterior  error 
covariance  estimate  for  September  3,  OOOOGMT  were 
utilized  to  perturb  the  ocean  fields.  A  white  noise  of 
an  amplitude  proportional  to  the  estimated  absolute 
and  relative  errors  in  the  observations  is  added  to  this 
random  combination,  in  part  to  represent  the  errors 
truncated  by  the  error  subspace.  An  ensemble  of  forecast 
simulations,  each  forced  by  forecast  COAMPS  atmo¬ 
spheric  fluxes  issued  on  September  2,  was  then  carried 
out.  These  ensemble  members  were  then  utilized  to 
compute  the  standard  deviations  shown  in  Pigs.  10,11. 

7  Future  work 

We  plan  to  fine  tune  our  ESSE  workflows  for  production 
using  the  Teragrid  as  well  as  test  them  on  the  Open 
Science  Grid.  We  would  like  to  investigate  the  efficacy  of 
a  data  scheduler  such  as  Stork  to  help  us  with  prestaging 
input  data.  We  also  plan  to  test  the  feasibility  of  a  mixed 
local/ Grid /EC2  run  employing  MyCluster.  Puture  more 
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Fig.  10.  ESSE  uncertainty  forecast  for  sea-surface  tem¬ 
perature  {°C) 
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Fig.  1 1 .  ESSE  uncertainty  forecast  for  30m  temperature 
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involved  experiments  are  expected  to  scale  from  1000  to 
10000  or  more  ESSE  ensemble  members  (and  even  more 
acoustic  calculations).  We  are  interested  in  seeing  how 
queuing  systems  and  resource  managers  handle  such 
a  workload  in  a  short  time  interval.  Eurthermore  more 
realistic  model  setups  are  expected  to  require  the  use  of 
nested  HOPS  calculations  which  are  executed  in  parallel 
-  thereby  introducing  the  concept  of  massive  ensembles 
of  small  (2-3  task)  MPI  jobs.  We  plan  to  simplify  the  use 
of  such  setups  via  the  use  of  an  XML  driven  validating 
graphical  user  interface  [24]. 

Another  area  where  MTC  would  be  most  valuable 
is  the  intelligent  coordination  of  autonomous  ocean 
sampling  networks.  To  achieve  optimal  and  adaptive 
sampling  [25],  [26],  [27],  [28],  [29],  large-dimensional 
nonlinear  stochastic  optimizations,  artificial  intelligence 
and  advance  Markovian  estimation  systems  can  be  re¬ 
quired.  Such  complex  systems  are  prime  examples  of 
MTC  problems  that  can  be  combined  with  our  uncer¬ 
tainty  estimations  [30]. 

8  Conclusion 

We  described  a  new  type  of  Many-Task  Computing 
application  that  is  very  relevant  to  Earth  and  Envi¬ 
ronmental  Science  applications  (and  prototypical  of  a 
general  class  of  ensemble-based  forecasting  and  esti¬ 
mation  methods).  We  introduced  the  concept  of  ocean 
data  assimilation,  discussed  the  ESSE  algorithm  and 
described  its  MTC  implementation  (and  its  variations 
along  with  their  justification).  Results  on  a  local  cluster 
were  presented  along  with  a  discussion  of  the  challenges 
of  scaling  out  and  solutions  for  doing  so  employing 
Grids  and  Clouds.  I/O  locality  issues  are  among  our 
main  concern.  We  believe  that  this  type  of  ensemble 
based  forecast  workflows  can  in  the  future  represent  an 
important  new  class  of  MTC  applications. 
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